Foundations of Data

Module 1.3: Using R

Alex Cardazzi

Old Dominion University

More R

By now, you can probably use R like a calculator – adding and subtracting single numbers, etc.


Calling this thing a ‘phone’ is like calling a Lamborghini a… cupholder. An incredibly elaborate cupholder.1


However, there are a lot of features that make R the Lamborghini of calculators.

Variables

In R, you can assign names to values (remember, objects). You do this by using either <- or =. Online, when Googling, you may find solutions with both. Despite what you might read, there are differences between the two, but we can ignore those differences for right now.

Why would you want to assign names to values? This allows your code to be much more flexible. Consider the following example.

Code
5 + 3
5 / 10
as.character(5)
Output
[1] 8
[1] 0.5
[1] "5"
Code
x <- 5
x + 3
x / 10
as.character(x)
Output
[1] 8
[1] 0.5
[1] "5"

Naming 5 as x allows us to change x only once, and the entire code will run. This will 1) reduce our effort 2) decrease typos / bugs and 3) increase readability. Now, x is not readable, per se, but this is just an example.

Variables

This may seem like a simple point, but it is very important. If you manipulate a variable in any way, but do not re-assign it to a name (same or different), it does not get updated/saved. Consider the following example.

Code
x <- 10
x + 5

x
Output
[1] 15
[1] 10
Code
x <- 10
x <- x + 5

x
Output
[1] 15
Code
x <- 10
y <- x + 5
x
y
Output
[1] 10
[1] 15

Naming Variables

It is important to choose informative names for your variables. Generally, single (or few) character names are easy to type, but can easily lose meaning. Too-long names aren’t great if you need to type them over and over. You will figure out a sweet spot for yourself.

There are some names you cannot use for your variable names, and other names that you simply shouldn’t. For example, you cannot start a variable name with a number. You cannot start names with certain punctuation either. On the other hand, you should not name things after already-used words that are native to R. This will just lead to confusing code. For example, do not name anything mean, because that is already a function name that is native to R.

Learning what is and what is not a good variable name takes time and practice.

Collections

So far, we have only worked with single values. Data tends to come in sets of multiple values, like large spreadsheets with columns and rows. Let’s built up to the R version of “spreadsheets”, which are called data.frames. We will touch on each of the following ways to store multiple values:

  • Vectors
  • Matrices
  • Lists
  • data.frame

Vectors

In R, the definition of a vector is a collection of values that are all of the same type. We use a c() to denote vectors. The c stands for combine. Once we have our vector, we can do different things to it. For example, we know how to add two values, but what about a vector and a single value? Or two vectors?

Code
vec1 <- c(1, 3, 5)
vec2 <- c(2, 4, 6)
vec1 + 10
vec1 + vec2
Output
[1] 11 13 15
[1]  3  7 11

Vectors

Notice how when we added 10 to vec1, 10 was added to each element of vec1. However, when we added the two vectors, addition was element-wise. If two vectors are of different lengths, R will “recycle” the shorter one to match the longer one.

Code
vec1 <- c(1, 2, 3)
vec2 <- c(1, 2, 3, 4)
vec1 + vec2
Output
[1] 2 4 6 5

As a quick aside, R has a really good help functionality. To access this, you need to put a ? in front of whatever you want help with. For example, suppose you need help with the mean function from before.

Code
?mean

Running this line (as a reminder: ctrl + enter) will bring you to the function’s documentation.

Here, mean becomes: mean(x, trim = 0, na.rm = FALSE, ...)

  • x, trim, and na.rm are the function’s arguments. These are inputs, and the function gives you an output.
    • x is the vector, x <- c(1, 4, 8, 7, 2), you want the mean of.
    • trim is the fraction of observations (elements in the vector) to be removed before taking the mean. You might want to remove the top and bottom 5% of observations since they might be outliers.
    • na.rm is a boolean that will remove NA values for you.

Vectors

R also has different ways (functions) to generate vectors. Here are some shortcuts:

Code
1:4 # This outputs every integer between 1 and 10
rep(1:4, times = 2) # Repeat 1-4 twice
# Function arguments are ordered, so it still works
# even without the "times ="
rep(1:4, 2)
seq(1, 4, by = .5) # Sequence from 1-4 by .5 increments
# 4 numbers drawn from a normal distribution
# with mean 0 and sd 1
rnorm(4, mean = 0, sd = 1)
Output
[1] 1 2 3 4
[1] 1 2 3 4 1 2 3 4
[1] 1 2 3 4 1 2 3 4
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
[1]  1.7049032 -0.7120386 -0.2779849 -0.1196490

Vectors

What happens if you have a vector of elements that are of different types?

Code
c(1, 2, "3")
c("1", "2", 3)
Output
[1] "1" "2" "3"
[1] "1" "2" "3"

Vectors

Let’s suppose you only want a part of a vector. You can select elements from vectors by index (it’s position in the vector) or by boolean values. You do this by typing the vectors name, followed by a square bracket, followed by another vector that containing indices or boolean values. Here some examples of how to do this:

Code
this_vec <- c(0, 8, 3, 6, 1, 2, 2, 7, 6)
# Select the 3rd, 4th, and 5th observations
this_vec[c(3, 4, 5)]
# Select every other observation
# Note: there are 9 elements, by R cycles through c(TRUE, FALSE)
# until it gets through all 9.
this_vec[c(TRUE, FALSE)]
# Select observations less than 5.
this_vec[this_vec < 5]
# Select observations less than 5 or greater than 7
this_vec[this_vec < 5 | this_vec > 7]
# Select observations less than 5 and greater than 7
# Notice the output here since the logic is impossible
# Something cannot be less than 5 AND greater than 7
this_vec[this_vec < 5 & this_vec > 7]
Output
[1] 3 6 1
[1] 0 3 1 2 6
[1] 0 3 1 2 2
[1] 0 8 3 1 2 2
numeric(0)

Vectors

Another important way to subset vectors is with the %in% operator. Suppose you have a vector of years as follows: c(2006, 2006, 2003, 2005, 2012, 2002, 2016, 2006, 2008). If you were to subset the vector where you only kept elements where years were equal to 2006, 2007, or 2008, you would have to write the following:

Code
v <- c(2006, 2006, 2003, 2005, 2012, 2002, 2016, 2006, 2008)
v[v == 2006 | v == 2007 | v == 2008]
Output
[1] 2006 2006 2006 2008

This can be very tedious, is prone to error/typo, and infeasible if the list were much longer (i.e., not just three years). As a shortcut, R has the following:

Code
v <- c(2006, 2006, 2003, 2005, 2012, 2002, 2016, 2006, 2008)
v[v %in% c(2006, 2007, 2008)]
# You could also do the following:
#   wants <- c(2006, 2007, 2008)
#   v[v %in% wants]
Output
[1] 2006 2006 2006 2008

Matrices

A collection of vectors (of similar type and length) is called a matrix. Matrices have two dimensions: rows and columns. Matrices look like: example_mat[rows,cols]. You can combine vectors using the rbind() (row-wise) or cbind() (column-wise). Let’s start by assuming you have a few vectors to work with.1

Code
v1 <- rnorm(4) # 4 random numbers
v2 <- 1:4 # 1 - 4
v3 <- 9:6 # 10 - 6
cbind(v1, v2, v3); cat("\n")
rbind(v1, v2, v3)
Output
             v1 v2 v3
[1,] -0.1239606  1  9
[2,]  0.2681838  2  8
[3,]  0.7268415  3  7
[4,]  0.2331354  4  6

         [,1]      [,2]      [,3]      [,4]
v1 -0.1239606 0.2681838 0.7268415 0.2331354
v2  1.0000000 2.0000000 3.0000000 4.0000000
v3  9.0000000 8.0000000 7.0000000 6.0000000

Matrices

Another way to generate matrices would be to put one giant vector into the matrix function. Of course, you will need to give matrix() a bit of help. You need to tell it something about the dimensions you’d like. This could be ncol for number of columns or nrow for number of rows. In addition, you should specify whether the vector is “by row” or not (i.e. “by column”).

If r1 denotes an element belonging on the first row, etc.:
A “by row” vector would be c(r1, r1, r1, r2, r2, r2, r3, r3, r3)
A “by column” vector would be c(r1, r2, r3, r1, r2, r3, r1, r2, r3)
Code
v_mat <- c(v1, v2, v3) # combine the three vectors
matrix(v_mat, ncol = 3); cat("\n")
matrix(v_mat, nrow = 3, byrow = TRUE)
Output
           [,1] [,2] [,3]
[1,] -0.1239606    1    9
[2,]  0.2681838    2    8
[3,]  0.7268415    3    7
[4,]  0.2331354    4    6

           [,1]      [,2]      [,3]      [,4]
[1,] -0.1239606 0.2681838 0.7268415 0.2331354
[2,]  1.0000000 2.0000000 3.0000000 4.0000000
[3,]  9.0000000 8.0000000 7.0000000 6.0000000

Matrices

Once you have the matrix of your dreams, you may need to access certain columns or rows. Remember: example_mat[rows, columns]. For vectors, if you want the first element, you would use example_vec[1]. For a matrix, example_max[1,] will give you the first row, example_max[,1] will give you the first column, and example_max[i,j] will give you the i\(^{th}\) row and j\(^{th}\) column. To select multiple rows, you can use logic or indices, much like vectors.

Code
matrix(v_mat, ncol = 3) -> mat
mat; cat("\n") # whole matrix
mat[1,]; cat("\n") # first row
mat[,2]; cat("\n") # second column
mat[1,2]; cat("\n") # first row, second column
mat[c(1, 3),]; cat("\n") # first and third row
# Select all rows where the elements in the first column are positive.
# mat[,1] > 0 will return boolean values, and mat[,] will return the rows with TRUE values
mat[ mat[,1] > 0 ,]
Output
           [,1] [,2] [,3]
[1,] -0.1239606    1    9
[2,]  0.2681838    2    8
[3,]  0.7268415    3    7
[4,]  0.2331354    4    6

[1] -0.1239606  1.0000000  9.0000000

[1] 1 2 3 4

[1] 1

           [,1] [,2] [,3]
[1,] -0.1239606    1    9
[2,]  0.7268415    3    7

          [,1] [,2] [,3]
[1,] 0.2681838    2    8
[2,] 0.7268415    3    7
[3,] 0.2331354    4    6

Lists

Lists are just like vectors, except each element can be a different type. In fact, each element of a list can be an entire vector! Lists, for this reason, are incredibly flexible. Sometimes, this can even make it difficult to work with lists.

Code
list(1, 2, "3"); cat("\n")
list(1, 2, c("3", "4", "5")); cat("\n")
list(1, 2, list(3, 4, c("5", "6")))
Output
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] "3"


[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] "3" "4" "5"


[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] 4

[[3]][[3]]
[1] "5" "6"

Lists

An interesting feature of lists, is that you can name the elements within the list. This is possible with vectors as well, but not as useful. Here are some examples of naming and using the names within lists.

Code
list(person1 = c("alex", "cardazzi"),
     person2 = c("jalen", "brunson"),
     person3 = c("thom", "yorke")) -> list_people
list_people[1]   # single square bracket...
list_people[[1]] # double square bracket...
list_people$person1 # or, you can use $ for double square bracket
Output
$person1
[1] "alex"     "cardazzi"

[1] "alex"     "cardazzi"
[1] "alex"     "cardazzi"

Lists

Changing the format of the list a little bit:

Code
list(first = c("alex", "jalen", "thom"),
     last = c("cardazzi", "brunson", "yorke")) -> namez
namez$first
namez$last[2:3]
Output
[1] "alex"  "jalen" "thom" 
[1] "brunson" "yorke"  

This is a special list because both vectors of the list have the same number of elements, or observations. Effectively, this is what a data.frame is. Really, it is what a spread sheet is – a collection of columns all with the same number of rows.

data.frame

So, what do data.frame’s look like?

Code
list(first = c("alex", "jalen", "thom"),
     last = c("cardazzi", "brunson", "yorke")) -> namez
as.data.frame(namez) -> namez_df
namez_df
Output
  first     last
1  alex cardazzi
2 jalen  brunson
3  thom    yorke

data.frame

Observations can be accessed in data.frames via the $ or [. These objects combine lists and matrices to make a more realistic view of the types of data that are most common in the real world.

Code
data.frame(first = c("alex", "jalen", "thom"),
     last = c("cardazzi", "brunson", "yorke"),
     num_of_albums = c(0, 0, 10),
     nba_seasons = c(0, 5, 0),
     phds = c(1, 0, 0),
     birth_country = c("us", "us", "uk")) -> df
df
Output
  first     last num_of_albums nba_seasons phds birth_country
1  alex cardazzi             0           0    1            us
2 jalen  brunson             0           5    0            us
3  thom    yorke            10           0    0            uk

data.frame

Suppose you want to subset the df object that you’ve created. Again, there are different ways to do this. Like matrices, to get some rows and all columns, you would use df[lim,] where lim is a vector of boolean values or indices. Leaving nothing following the comma indicates to R that you want everything in that dimension. To get columns, you can reverse this (df[,3:4]) or use names (df[,c("nba_seasons", "phds")]). If you only want a single column, of course, you can use df$phds.

data.frame

As a final note about data.frames, here are a few important functions:

  • nrow(): Returns the number of rows in a data.frame.
  • ncol(): Returns the number of columns in a data.frame.
  • colnames(): Returns the names of columns in a data.frame.

Writing a Script

To write a new script, click on the top left button underneath “File”. You should be able to see a white paper icon with a green +. This will open up a menu of different files. Just select “R Script” for now.

In the near future, we will learn to use R Markdown and/or Quarto Documents as well.

Writing a Script

Now, you’re ready to write your first RScript… now what?

For beginners, as a template, the top of your script should look like the following:

Code
library("")

rm(list = ls())
setwd("")
  • library(""): This is where you load any packages you might want. We will discuss packages later.
  • rm(list = ls()): This is how you clear your environment. It is a good idea to start with an empty enviornment so you don’t get confused between what is old and what is new.
  • setwd(""): This is where you set your working directory

Working Directories

  • Computers have different folders, sometimes called directories. For example, you might have a folder on your computer called “Documents”. To stay organized, you might make a folder inside “Documents” called “Econ 311” where you will put all of your stuff for this course. You make more folders inside this folder called “HW”, “Lectures”, etc.
  • Suppose you save a file to your “HW” folder, and you name the file “HW01.html”. This file’s path looks like Documents/Econ 311/HW/HW01.html.
  • Now, suppose you have some data in this folder, and you want R to find it and read it. Well, you can’t tell R just the name of the file because it doesn’t know what directory to look in.
  • So, to tell R, you can do one of two things:
    1. Supply the file’s path along with its name.
    2. Tell R where all the files you’re working on will live.

Working Directories

R has two helpful functions for working directories. First, getwd() tells you where R is currently looking.

Code
getwd()
Output
[1] "C:/Users/alexc/Dropbox/teaching/Fall 2023/econ311/module01"

Then, if I wanted to change this, I would use setwd(). Here, I could either:

  1. Type in my entire new working directory
  2. Navigate to the new working directory

Working Directories

Code
# Two periods means "go back one level"
# So, if we were in "Documents/Econ 311/HW01"
# ".." would bring us to "Documents/Econ 311"
setwd("..")

# If there was another folder inside "HW01",
# for example, suppose you have a folder "Data" inside "HW01"
# From "HW01", you can navigate to "Data" like:
setwd("Data")

# Maybe you want to go from "HW01" to "HW02/Data"
# You would have to back out from HW01 (using "..")
# Then go into HW02 and Data
setwd("../HW02/Data")

Working Directories

Of course, to use the ".." trick, you need to know where you’re starting from (i.e. getwd()). If you are unsure, you could always type in your entire working directory in one go.

Code
setwd("C:/Users/alexc/Dropbox/teaching/Fall 2023/econ311/HW02/Data")

Reading Data

Now that R knows where to look, we need it to import our data so we can use it. Most of the time in this course, we will use files that end in .csv. This stands for “comma separated values”. .csv files are very common and require relatively small amounts of storage. .csv files are also open-able in Excel (you just might get some warning about how any Excel formulas you write will not be saved). To create a .csv from an .xlsx (Excel) file, just use “Save As” in Excel, and change the file extension to “Comma Separated Values (.csv)”.

Now, let’s read in a file called “ford_escort.csv”. On my machine, it is in the folder “data”. Rather than changing my working directory, I can just put the path of the file. We will use the read.csv() function to import our data.

Reading Data

Code
# Since my working directory is already in the correct folder,
# I can focus on reading the data in.
# Since my working directory is in "econ311/module01",
# But the file is in "econ311/data",
# I need to back out of "module01" and navigate to the data folder
ford <- read.csv("../data/ford_escort.csv")
dim(ford); cat("\n") # dim() gives the number of columns and rows
head(ford) # the head() function displays the first 6 rows.

# I could have also done:
# setwd("data")
# ford <- read.csv("ford_escort.csv")
Output
[1] 23  3

  Year Mileage..thousands. Price
1 1998                  27  9991
2 1997                  17  9925
3 1998                  28 10491
4 1998                   5 10990
5 1997                  38  9493
6 1997                  36  9991

Manipulating Data

Now that we have our data read into R, we can manipulate it:

  1. Change the second column name from Mileage..thousands. to mileage.
  2. How many Escapes were less than $9000?
  3. Currently, mileage is in thousands. Multiply it by 1000, and save over the original variable.
  4. Find the cost per mile ($/mi) and save it as a new variable.
  5. Calculate the average cost per mile.
  6. Use the range() function to find the minimum and maximum $/mi.

As a note, you can use cat() to combine text and code. Put "\n" at the end to make a new line. You can experiment with this on your own.

Manipulating Data

Code
# colnames(ford) # Take a look at the column names.
colnames(ford)[2] <- "mileage" # change the second name

# ford$Price < 9000 # this gives boolean values.
# Since R treats TRUE as 1 and FALSE as 0, use sum()
cat("Number of Escapes less than $9,000:", sum(ford$Price < 9000), "\n")

ford$mileage <- ford$mileage * 1000 # Multiply by 1000 and save
ford$cost_per_mile <- ford$Price / ford$mileage # Create $/mi

cat("Average Cost per Mile:", mean(ford$cost_per_mile), "\n") # average
cat("Range of Cost per Mile:", range(ford$cost_per_mile)) # min and max
Output
Number of Escapes less than $9,000: 5 
Average Cost per Mile: 0.4315369 
Range of Cost per Mile: 0.08325 2.198

Plotting

An especially attractive feature of R is its powerful graphics. Just Google “Best R Plots”, or something, and you’ll see what I mean.

To start, we’ll learn some of the basics. We will begin by generating two scatter plots using data from ford.

  1. Plot mileage vs Price.
  2. Plot mileage vs cost_per_mile.

Plotting

Code
plot(ford$mileage, ford$Price)
Plot

Scatterplot of mileage on the x-axis and price on the y-axis. There appears to be a negative relationship.

Plotting

Below contains some examples of additional arguments for the plot() function.

  1. las = 1 rotates the text on the y-axis. Different numbers will rotate it more or less
  2. col sets the colors used in the plot. This can take a vector of colors.
  3. pch sets the type of point used.
  4. cex sets the size of the points. The default is 1.
  5. main sets the title of the plot.
  6. xlab sets the name of the x-axis
  7. ylab sets the name of the y-axis
Code
plot(ford$mileage, ford$cost_per_mile, las = 1,
     pch = 23, cex = 1.2,
     col = "tomato", main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
Plot

Scatterplot of mileage on the x-axis and price per mile on the y-axis. There is a strong, non-linear relationship.

Plotting

We can also add reference lines to the plot, and also make the colors a bit more complex.

Code
# Set all colors as "tomato"
ford$point_color <- "tomato"
# If the Year is less than the mean year, color it "dodgerblue"
# Of course, these are therefore the "older" cars
ford$point_color[ford$Year < mean(ford$Year)] <- "dodgerblue"
plot(ford$mileage, ford$cost_per_mile, las = 1,
     pch = 19, cex = 1.2,
     col = ford$point_color, main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
abline(h = 1) # horiz. line at Y = 1
abline(v = mean(ford$mileage)) # vert. line at the mean of X
Plot

Scatterplot of mileage on the x-axis and price per mile on the y-axis. There is a strong, non-linear relationship.

Plotting

Of course, whenever you choose to add some differences in shapes, colors, etc., it’s helpful to add a legend to your plot. To do this, we can use the legend() function. This function accepts a few important arguments:

  • bty: setting this to "n" removes the box around the legend. I always use this option.
  • legend: this is the actual text to be displayed in the legend. It accepts a character vector, so if you colored your plot by men and women, you would use c("Men", "Women").
  • x, y: You can specify the exact coordinates of your legend, or you can specify things like: "topleft", "topright", "bottomleft", or "bottomright".
  • horiz: this accepts a boolean value, and turns the legend from vertical to horizontal.
  • Then, you will need to specify either pch or lty options to tell R if you want to display points or lines next to your legend.

Plotting

Below is a plat with two legends (which is certainly redundant) to show off some of the different ways to customize the output.

Code
plot(ford$mileage, ford$cost_per_mile, las = 1,
     pch = 19, cex = 1.2,
     col = ford$point_color, main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
legend("topright", pch = 19, bty = "n", horiz = TRUE,
       legend = c("Old Ford", "New Ford"), cex = 1.5,
       col = c("dodgerblue", "tomato"))
legend("bottomleft", lty = c(1, 2), pch = c(2, 19),
       legend = c("Old Ford", "New Ford"),
       col = c("dodgerblue", "tomato"))
Plot

Same plot as before, but with one legend in the top right and another in the bottom left.

Plotting

When generating figures, you will sometimes need to add data from a different source to the same set of axes. As an example, let’s simply plot the data above, but in two steps instead of one.

To do this, we will use points(). This function accepts nearly every argument plot() does, except you are unable to impact the axes/labels of the plot.

Code
plot(ford$mileage[ford$point_color == "tomato"],
     ford$cost_per_mile[ford$point_color == "tomato"],
     las = 1, pch = 19, cex = 1.2,
     col = "tomato", main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
       ford$cost_per_mile[ford$point_color != "tomato"],
       pch = 19, cex = 1.2, col = "dodgerblue")
Plot

Plot of mileage on the x-axis and cost per mile on the y-axis.

Plotting

Once you get the hang of using plot() and points() in tandem, you’ll find it convenient that points() does not impact the axes. However, to start, this will be annoying. For example, let’s switch the order of the data in plot() and points().

Code
plot(ford$mileage[ford$point_color != "tomato"],
     ford$cost_per_mile[ford$point_color != "tomato"],
     las = 1, pch = 19, cex = 1.2,
     col = "dodgerblue", main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color == "tomato"],
       ford$cost_per_mile[ford$point_color == "tomato"],
       pch = 19, cex = 1.2, col = "tomato")
Plot

Same plot as before, except this plot is significantly cutoff.  All of the blue dots, the ones where the age is below the mean, are showing but many red dots are cut off.

The plot is different because when plot() is setting the axes, it doesn’t know that you’re planning on using points() next. So, it scales the axes so the data fed into plot() “fits” the space.

Plotting

To overcome this, we can use the following trick:

Code
plot(0, 0, type = "n",
     ylim = range(ford$cost_per_mile),
     xlim = range(ford$mileage), # range can include multiple vectors
     main = "Cost vs Mileage", las = 1,
     xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
       ford$cost_per_mile[ford$point_color != "tomato"],
       pch = 19, cex = 1.2, col = "dodgerblue")
points(ford$mileage[ford$point_color == "tomato"],
       ford$cost_per_mile[ford$point_color == "tomato"],
       pch = 19, cex = 1.2, col = "tomato")
Plot

Plot of mileage on the x-axis and cost per mile on the y-axis.

Of course, this is a lot more coding than the initial plot’s code. The idea of showing you this is that, now, you can always make sure your data “fits”. This is one of the little things that I use constantly, but it took me a long time to figure out.

Plotting

Finally, adding lines to a plot is very similar in that one needs to use lines(). To illustrate, we will examine panel data on cigarette consumption by state (documentation).

I am going to read in the data and plot sales by year.

Code
cig <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Cigar.csv")
cig$year <- cig$year + 1900
plot(cig$year, cig$sales, las = 1,
     ylab = "Sales", xlab = "Year")
Plot

Plot of cigarette pack sales on the y-axis and time on the x-axis.

Plotting

This figure is very difficult to understand. Let’s trim it down to just a few states.1 In addition, we can add colors to the figure.

Code
cig <- cig[cig$state %in% 1:5,]
plot(cig$year, cig$sales, las = 1,
     # since state is a number,
     #  we can just use this as the color
     col = cig$state,
     ylab = "Sales", xlab = "Year")
Plot

Plot of cigarette pack sales on the y-axis and time on the x-axis. This time, however, only four states are displayed.

This plot can still be improved. It’d be a lot more natural to see the data as lines instead of points. To do this, we can use type = "l".

Plotting

Code
plot(cig$year, cig$sales, las = 1,
     col = cig$state, type = "l",
     ylab = "Sales", xlab = "Year")
Plot

Plot of cigarette pack sales on the y-axis and time on the x-axis.  There are three diaganol lines that connect the last data point in the time series of one state to the first data point in the time series of another state.

Notice two things about this plot. First, there’s only a single color. In R, you should think of a line as a single point. R cannot color different parts of line differently, so it will just take the first color (here, it’s 1, which is black). Second, there are these insane diagonal lines across the plot. This is because R wants to connect everything into a single line.

Plotting

To fix this, we need to use lines like we used points before. This is an example case of a time where we’ll want to set up the axes before we plot anything.

Code
# before, I plotted 0, 0
# now, I am simply keeping the data
#   in plot().
# this way, I don't need to set the axes
#   via ylim() and xlim()
plot(cig$year, cig$sales,
     las = 1, type = "n",
     ylab = "Sales", xlab = "Year")
lines(cig$year[cig$state == 1],
      cig$sales[cig$state == 1],
      col = 1)
lines(cig$year[cig$state == 3],
      cig$sales[cig$state == 3],
      col = 3)
lines(cig$year[cig$state == 4],
      cig$sales[cig$state == 4],
      col = 4)
lines(cig$year[cig$state == 5],
      cig$sales[cig$state == 5],
      col = 5)
legend("bottomleft", ncol = 2,
       legend = c("State 1", "State 3", "State 4", "State 5"),
       bty = "n", col = c(1, 3, 4, 5), lty = 1)
Plot

A correct time series plot of the first four states, each colored differently.

Next module, we’ll learn about “loops”, which will cut down on the amount of code we need to write to generate these lines.